Creator: Nick Wheatley
For this demonstration I wanted to see if scatterplot is a good way to determine if a team's offensive, defensive, and total efficiency corresponds with their success in the tournament. See definitions of the metrics below:
$$\text{Offensive Efficiency} = \frac{\text{Points Scored}}{\text{100 Possessions}}$$ $$\text{Defensive Efficiency} = \frac{\text{Points Allowed}}{\text{100 Opponent Possessions}}$$ $$\text{Total Efficiency} = \frac{\text{Offensive Efficiency}}{\text{Defensive Efficiency}}$$
These NCAA team metrics were sourced from barttorvik.com for 2017-2021 (excl 2020 due to COVID). Data for each team's final round and tournament seeding is also available from the website.
The get_ncaa_tournament_data function allows the user to either pull live data from the website or from the cached csv file.
The following steps detail necessary actions taken in
get_ncaa_tournament_data to clean data pulled from barttorvik's site:
Games Won using Final Round column and the get_games_won functionimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from get_ncaa_data import get_ncaa_tournament_data
warnings.filterwarnings("ignore")
Once the modules have been run, we're ready to pull and display the data.
df = get_ncaa_tournament_data([2017,2018,2019,2021])
df.insert(0,'Year-Team',df['Year'].map(str) +' '+df['Team']) # Concatenate team name and year to sort duplicate teams
df = df.set_index('Year-Team').drop(['Year','Team'],axis=1)
df.head()
Now that the data has been loaded, let's see how the different variables correlate using seaborn's pairplot to create a SPLOM.
sns.set_theme() #add the gray grid backdrop
sns.pairplot(df.iloc[:,2:],corner=True) #display pairplot
Lots of moving pieces here. Let's break down some of the key takeaways.
Games Won. Teams with 0 wins lose in either the round of 68 (R68) or R64, and teams with 6 wins are tournament championsTotal Efficiency, low Defensive Efficiency, and high Offensive EfficiencyLet's explore subplots (1,1) and (3,2) a little more closely (TE to Games Won and OE to DE).
Using sns.regplot we fit a linear regression model to the data to explore the trend of the data. Let's fit a line to subplot (1,1).
sns.regplot(x=df['Games Won'],y=df['Total Efficiency'])
Clearly a strong relationship exists between Games Won and Total Efficiency. As Total Efficiency is derived from Offensive Efficiency and Defensive Efficiency, let's move upstream in subplot (3,2).
sns.scatterplot(x=df['Offensive Efficiency'],y=df['Defensive Efficiency'].sort_values(ascending=True))
Not super helpful. Let's see what happens if we invert our yaxis seeing how low Defensive Efficiency correlates with higher Games Won
ax = sns.scatterplot(x=df['Offensive Efficiency'],y=df['Defensive Efficiency'].sort_values(ascending=True))
ax.invert_yaxis()
Much better. However this plot still offers little information on whether Defensive and Offensive Efficiency predict tournament champions. Let's use scatterplots hue method to color code each team.
fig, ax = plt.subplots()
fig.set_size_inches(12.25, 8.25)
sns.scatterplot(x=df['Offensive Efficiency'],y=df['Defensive Efficiency'],hue=df['Final Round'],hue_order = [
'R68','R64','R32','Sweet Sixteen','Elite Eight','Final Four','Finals','CHAMPS'],ax=ax).invert_yaxis()
Effective! We now see that tournament winners are always located in the upper right quadrant, suggesting high Total Efficiency. However, this plot is still pretty crowded. What happens if we create a color-coded, bubble chart using hue and size to scale according to Games Won? Let's add star markers for champions, and add a title while we're at it.
fig, ax = plt.subplots()
fig.set_size_inches(12.25, 8.25)
markers = {0:'o',1:'o',2:'o',3:'o',4:'o',5:'o',6:'*'}
sizes = {0:20,1:40,2:60,3:80,4:100,5:200,6:400}
sns.scatterplot(x='Offensive Efficiency',y='Defensive Efficiency',style='Games Won',
markers=markers,size='Games Won',sizes=sizes,hue='Games Won',palette='Reds',data=df,ax=ax,legend='full')
ax.invert_yaxis()
ax.legend(title='Games Won')
leg = ax.get_legend()
leg.get_texts()[-1].set_text('Champion') #Change name of last label to display 'Champion'
ax.set_title("'17-'21 NCAA Tournament Teams")
Now we're talking! We clearly see that teams outside of the top right quadrant rarely get beyond the first round, while teams with a strong Offensive Efficiency and low Defensive Efficiency usually make it to the final rounds. It further appears that a strong OE has a greater impact than a low DE. A team with an OE of 120+ usually makes it to at least the Final Four so long as they have a DE below 95.
It would be nice to see which markers represent which teams, but this is where seaborn meets its limit. To exploit the hover function, we need to introduct plotly. plotly.express allows the user to recreate the above seaborn graphs, but further allow some interaction.
import plotly.express as px
markers = {0:'circle',1:'circle',2:'circle',3:'circle',4:'circle',5:'circle',6:'star'}
sizes = {0:20,1:40,2:60,3:80,4:100,5:200,6:400}
fig = px.scatter(df, x="Offensive Efficiency", y="Defensive Efficiency", color="Games Won",hover_name=df.index,hover_data=['Total Efficiency','Final Round'],
color_continuous_scale='Reds',title="'17-'21 NCAA Tournament Teams",symbol='Games Won',symbol_map=markers,size='Games Won',size_max=15)
fig.update_yaxes(autorange="reversed")
fig.update_traces(showlegend=False) #remove duplicate legend
fig.write_html("2017_2021_ncaa_tournament_teams.html")
fig.show()
Now we have the visual aspects of seaborn combined with the interactivity of plotly. Using the hover methods, we can now see the team names, their Total Efficiency and their Final Round in addition to the other variables.
Now for fun, let's see how this looks in the ongoing NCAA Tournament for 2022. Because it's ongoing, let's uses an X marker to show teams that have lost and a Circle marker to indicate teams still playing. Let's also scale hue and size based on Total Efficiency.
df22 = pd.read_csv('ncaa_tournament_teams_2022.csv')
df22.head()
markers = {'✅':'circle','❌':'x-open'}
fig = px.scatter(df22, x="Offensive Efficiency", y="Defensive Efficiency",hover_name='Team',hover_data=['Final Round'],
title="'22 NCAA Tournament Teams To Date",symbol='Final Round',symbol_map=markers,size='Total Efficiency',size_max=10,
color='Total Efficiency',color_continuous_scale='Reds')
fig.update_yaxes(autorange="reversed")
fig.update_traces(showlegend=False)
fig.show()
There we have it! The historical pattern flags Gonzaga as the clear favorite, with a likely Final Four of Gonzaga, Kansas, Houston, and Purdue. Because OE and DE are largely calculated by each team's regular season, it's possible that teams from a weaker conference are over-inflated by competing with less competitive teams. Gonzaga has been the favorite multiple years, but has yet to claim a title. In future exploration, one might adjust efficiency metrics by the weight of their conference competitiveness to see how that impacts possible outcomes.